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Abstract. Given string S[1..A'^] and integer fe, the sujjix selection problem is to determine 
the kih lexicographically smallest amongst the suffixes S[i . . . A'^], 1 < i < A^. We study the 
suffix selection problem in the cache-aware model that captures two-level memory inherent 
in computing systems, for a cache of limited size M and block size B. The complexity of 
interest is the number of block transfers. We present an optimal suffix selection algorithm 
in the cache-aware model, requiring O {N/ B) block transfers, for any string S over an 
unbounded alphabet (where characters can only be compared), under the common tall- 
cache assumption (i.e. M — Q, (_B^^'), where e < 1). Our algorithm beats the bottleneck 
bound for permuting an input array to the desired output array, which holds for nearly 
any nontrivial problem in hierarchical memory models. 
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1. Introduction 

Background: Selection vs Sorting. A collection of N numbers can be sorted using 
{N log N) comparisons. On the other hand, the famous five-author result [2] from early 
70's shows that the problem of selection — choosing the fcth smallest number — can be 
solved using O (N) comparisons in the worst case. Thus, selection is provably simpler than 
sorting in the comparison model. 

Consider a sorting vs selection question for strings. Say S = S[l • ■ • A^] is a string. The 
suffix sorting problem is to sort the suffixes S[i ■ ■ ■ N], i = 1, . . . ,N, in the lexicographic 
order. In the comparison model, we count the number of character comparisons. Suffix 
sorting can be performed with O {N log A'^) comparisons using a combination of character 
sorting and classical data structure of suffix arrays or trees |1H O II]- There is a lower 
bound of 0, (NlogN) since sorting suffixes ends up sorting the characters. For the related 
suffix selection problem where the goal is to output the kih lexicographically smallest suffix 
of S, the result in ^ recently gave an optimal O (N) comparison-based algorithm, thereby 
showing that suffix selection is provably simpler than suffix sorting. 
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The Model. Time-tested architectural approaches to computing systems provide two (or 
more) levels of memory: the highest one with a limited amount of fast memory; the lowest 
one with slow but large memory. The CPU can only access input stored on the fastest level. 
Thus, there is a continuous exchange of data between the levels. For cost and performance 
reasons, data is exchanged in fixed-size blocks of contiguous locations. These transfers may 
be triggered automatically like in internal CPU caches, or explicitly, like in the case of 
disks; in either case, more than the number of computing operations executed, the number 
of block transfers required is the actual bottleneck. 

Formally, we consider the model that has two memory levels. The cache level contains 
M locations divided into blocks (or cache lines) of B contiguous locations, and the main 
memory level can be arbitrarily large and is also divided into blocks. The processing unit 
can address the locations of the main memory but it can process only the data residing in 
cache. The algorithms that know and exploit the two parameter M and B, and optimize 
the number of block transfers are cache- aware. This model includes the classical External 
Memory model [Ij as well as the well-known Ideal-Cache model 

Motivation. Suffix selection as a problem is useful in analyzing the order statistics of 
suffixes in a string such as the extremes, medians and outliers, with potential applications 
in bioinformatics and information retrieval. A quick method for finding say the suffixes 
of rank i(n/10) for each integer z, < i < 10, may be used to partition the space of 
suffixes for understanding the string better, load balancing and parallelization. But in 
these applications, such as in bioinformatics, the strings are truly massive and unlikely to 
fit in the fastest levels of memory. Therefore it is natural to analyze them in a hierarchical 
memory model. 

Our primary motivation however is really theoretical. Since the inception of the first 
block-based hierarchical memory model ([r],[10]), it has been difficult to obtain "golden 
standard" algorithms i.e., those using just 0{N/B) block transfers. Even the simplest 
permutation problem (PERM henceforth) where the output is a specified permutation of 
the input array, does not have such an algorithm. In the standard RAM model, perm 
can be solved in O (N) time. In both the Ideal-Cache and External Memory models, 

the complexity of this problem is denoted perm(A^) = Q (^min ^N, {N/ Blogj^/^ N/B^^ . 
Nearly any nontrivial problem one can imagine from list ranking to graph problems such 
as Euler tours, DFS, connected components etc., sorting and geometric problems have the 
lower bound of PERM(iV), even if they take O (N) time in the RAM model, and therefore 
do not meet the "golden standard" . Thus the lower bound for perm is a terrible bottleneck 
for block-based hierarchical memory models. 

The outstanding question is, much as in the comparison model, is suffix selection prov- 
ably simpler than suffix sorting in the block-based hierarchical memory models? Suffix 

sorting takes (^(N/B) log j^^/b{N/B)^ block transfers [5]. Proving any problem to be sim- 
pler than suffix sorting therefore requires one to essentially overcome the perm bottleneck. 

Our Contribution. We present a suffix selection algorithm with optimal cache complex- 
ity. Our algorithm requires G {N / B) block transfers, for any string S over an unbounded 
alphabet (where characters can only be compared) and under the common tall-cache as- 
sumption, that is M = O (i?^+'^) with e < 1. Hence, we meet the "golden standard"; we 
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beat the PERM bottleneck and consequently, prove that suffix selection is easier than suffix 
sorting in block-based hierarchical memory models. 

Overview. Our high level strategy for achieving an optimal cache-aware suffix selection 
algorithm consists of two main objectives. 

In the first objective, we want to efficiently reduce the number of candidate suffixes 
from N io O {N/B), where we maintain the invariant that the wanted /cth smallest suffix 
is surely one of the candidate suffixes. 

In the second objective, we want to achieve a cache optimal solution for the sparse suffix 
selection problem, where we are given a subset of O {N/B) suffixes including also the wanted 
fcth suffix. To achieve this objective we first find a simpler approach to suffix selection for 
the standard comparison model. (The only known linear time suffix selection algorithm for 
the comparison model [6| hinges on well-known algorithmic and data structural primitives 
whose solutions are inherently cache inefficient.) Then, we modify the simpler comparison- 
based suffix selection algorithm to exploit, in a cache-efficient way, the hypothesis that 
O {N/B) (known) suffixes are the only plausible candidates. 

Map of the paper. We will start by describing the new simple comparison-based suffix selec- 
tion algorithm in Section [2l This section is meant to be intuitive. We will use it to derive a 
cache-aware algorithm for the sparse suffix selection problem in Section [3l We will present 
our optimal cache-aware algorithm for the general suffix selection problem in Section SI 

2. A Simple(r) Linear-Time SufRx Selection Algorithm 

We now describe a simple algorithm for selecting the fcth lexicographically smallest 
suffix of S in main memory. We give some intuitions on the central notion of work, and 
some definitions and notations used in the algorithm. Next, we show how to perform main 
iteration, called phase transition. Finally, we present the invariants that are maintained in 
each phase transition, and discuss the correctness and the complexity of our algorithm. 

Notation and intuition. Consider the regular linear-time selection algorithm [2], hereafter 
cahed BFPRT. Our algorithm for a string S = S[l . . .N] uses bfprt as a black boxQ 
Each run of BFPRT permits to discover a longer and longer prefix of the (unknown) fcth 
lexicographically smallest suffix of S. We need to carefully orchestrate the several runs of 
BFPRT to obtain a total cost of 0{N) time. We use S = bbbabbbbbaaS, where n = 12, as 
an illustrative example, and show how to find the median suffix (hence, k = n/2 = 6). 

Phases and phase transitions. We organize our computation so that it goes through phases, 
numbered t = 0, 1, 2, . . . and so on. In phase t, we know that a certain string, denoted at, 
is a prefix of the (unknown) kth lexicographically smallest suffix of S. Phase t = is the 
initial one: we just have the input string S and no knowledge, i.e., o"o is the empty string. 
For t > 1, a main iteration of our algorithm goes from phase t — 1 to phase t and is termed 
phase transition {t — 1 ^ t): it is built around the tth run of BFPRT on a suitable subset of 
the suffixes of S. Note that t < N, since we ensure that the condition |(Ti_i| < \at\ holds, 
namely, each phase transition extends the known prefix by at least one symbol. 



In the following, we will assume that the last symbol in S is an endmarker S[N] — $, smaller than any 
other symbol in S. 
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Phase transition (0 ^ 1). We start out with phase 0, where we run bfprt on the individual 
symbols of S*, and find the symbol a of rank k m. S (seen as a multiset). Hence we know 
that o"! = a, and this fact has some implications on the set of suffixes of 5. Let Si denote 
the ith suffix S[i . . . N] of 5, for 1 < ^ < A^, and Wi be a special prefix of Si called work. 
We anticipate that the works play a fundamental role in attaining 0{N) time. To complete 
the phase transition, we set wi = S[i] for 1 < i < A^, and we call degenerate the works Wi 
such that Wi ^ a. (Note that degenerate works are only created in this phase transition.) 
We then partition the suffixes of S into two disjoint sets: 

• The set of active suffixes^ denoted by — they are those suffixes such that Wi = 
o"! = a. 

• The set of inactive suffixes, denoted by Zi and containing the rest of the suffixes — 
none of them is surely the fcth lexicographically smallest suffix in S. 

In our example {k = 6), we have o"! = a = b and, for i = 1, 2, 3, 5, 6, 7, 8, 9, = b and 
Si Ai. Also, we have Sj € Ti for j = 4, 10, 11, 12, where = u;io = wu = a and wi2 = $ 
are degenerate works. 

A comment is in order at this point. We can compare any two works in constant 
time, where the outcome of the comparison is ternary [<,=,>]. While this observation is 
straightforward for this phase transition, we will be able to extend it to longer works in the 
subsequent transitions. Let us discuss the transition from phase 1 to phase 2 to introduce 
the reader to the main point of the algorithm. 

Phase transition (1^2). If = 1, we are done since there is only one active suffix and 
this should be the feth smallest suffix in S. Otherwise, we exploit the total order on the 
current works. Letting zi be the number of works smaller than the current prefix cii, our 
goal becomes how to find the {k — zi)th smallest suffix in Ai. In particular, we want a 
longer prefix c72 and the new set A2 Ai. 

To this end, we need to extend some of the works of the active suffixes in Ai. Consider 
a suffix Si € ^1. In order to extend its work voi, we introduce its prospective work. Recall 
that Wi = ai = a = S[i]. If ifj+i = S[i + 1\ ^ a (hence, Sj+i is inactive in our terminology), 
the prospective work for Si is the concatenation WiWi^i, where Sj+i G 2i. Otherwise, since 
Wi = Wi+i (and so Sj+i G ^1), we consider i + 2,z + 3, and so on, until we find the first 
i + r such that Wi / Wi+r (and so Si+r £ ^i)- In the latter case, the prospective work for 
Si is the concatenation WiWi+i ■ ■ -Wi+r, where Wi = tfj+i = • • • = WiJ^r-i = o\ = a and 
their corresponding suffixes are active, while Wi^r 7^ o"i is different and corresponds to an 
inactive suffix. 

In any case, each prospective work is a sequence of works of the form a^/? = where 
r > 1 and f3 ^ a. The reader should convince herself that any two prospective works can 
be compared in O (1) time. We exploit this fact by running bfprt on the set Ai of active 
suffixes and, whenever bfprt requires to compare any two G Ai, we compare their 

prospective works. Running time is therefore 0(|^i|) if we note that prospective works can 
be easily identified by a scan of Ai: if WiWi+i ■ ■ ■ Wi+r is the prospective work for Si, then 
Wi-\-i ■ ■ ■ Wi+r is the prospective work for Sj+i, and so on. In other words, a consecutive run 
of prospective works forms a collision, which is informally a maximal concatenated sequence 
of works equal to cri terminated by a work different from ai (this notion will be described 
formally in Section [2]). 
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After BFPRT completes its execution, we know the prospective work that is a prefix of 
the (unknown) A:th suffix in S. That prospective work becomes (T2 and A2 is made up of 
the the suffixes in Ai such that their prospective work equals (T2 (and we also set 2:2)- 

In our example, zi = 3, and so we look for the third smaller suffix in ^1. We have 
the following prospective works: one collision is made up of pi = bbba, p2 = bba, and 
P3 = ba; another collision is made up of p^ = bbbbba, = bbbba, pi = bbba, ps = bba, 
and pg = ba. Algorithm bfprt discovers that bba is the third prospective work among 
them, and so a2 = bba and A2 = {s2, sg} (and Z2 = 5). 

How to maintain the works. Now comes the key point in our algorithm. For each suffix 
Si £ A2, we update its work to be Wi = (T2 (whereas it was Wi = cJi in the previous phase 
transition, so it is now longer). For each suffix Si € ^1 — A2, instead, we leave its work Wi 
unchanged. Note this is the key point: although Sj can share a longer prefix with a2, the 
algorithm BFPRT has indirectly established that Si cannot have o"2 as a prefix, and we just 
need to record a Boolean value for Wi, indicating if Wi is either lexicographically smaller 
or larger than a2- We can stick to Wi unchanged, and discard its prospective work, since 
Si becomes inactive and is added to l2- In our example, W2 = wq = bba, while the other 
works are unchanged (i.e, ^3 = b while ps = ba, tt^s = b while p^ = bbbbba, and so on). 

In this way, we can maintain a total order on the works. If two works are of equal length, 
we declare that they are equal according to the symbol comparisons that we have performed 
so far, unless they are degenerate — in the latter case they can be easily compared as single 
symbols. If two works are of different length, say \wi\ < \wj\, then Sj has been discarded by 
BFPRT in favor of Sj in a certain phase, so we surely know which one is smaller or larger. 
In other words, when we declare two works to be equal, we have not yet gathered enough 
symbol comparisons to distinguish among their corresponding suffixes. Otherwise, we have 
been able to implicitly distinguish among their corresponding suffixes. In our example, 
"Ws < W2 because they are of different length and bfprt has established this disequality, 
while we declare that W3 = since they have the same length. Recall that the total order 
on the works is needed for comparing any two prospective works in O (1) time as we proceed 
in the phase transitions. The works exhibit some other strong properties that we point out 
in the invariants described in Section [2l 

Time complexity. From the above discussion, we spend 0(|^i)|) time for phase transition 
(1 — > 2). We present a charging scheme to pay for that, works come again into play for 
an amortized cost analysis. Suppose that, in phase 0, we initially assign each suffix Sj two 
kinds of credits to be charged as follows: 0(1) credits of the first kind when Sj becomes 
inactive, and further 0(1) credits of the second kind when Sj is already inactive but its 
work Wi becomes the terminator of the prospective work of an active suffix. Note that Wi 
is incapsulated by the prospective work of that suffix (which survives and becomes part of 
A2). 

Now, when executing bfprt on Ai as mentioned above, we have that at most one 
prospective work survives in each collision and the corresponding suffix becomes part of 
A2- We therefore charge the cost 0(|^i|) as follows. We take ©(|^i| — \A2\) credits of 
the first kind from the |^i| — \ A2\ > active suffixes that become inactive at the end of 
the phase transition. We also take G(|^2|) credits from the \A2\ inactive suffixes whose 
work terminates the prospective work of the survivors. In our example, the O (|.4i| — \A2\) 
credits are taken from si, S3, S5, sg, S7, and sg, while (|w42|) credits are taken from S4 and 
sio- 
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At this point, it should be clear that, in our example, the next phase transition (2 — > 3) 
looks for the {k — Z2)th smaller suffix in A2 by executing bfprt in 0(|„42|) time on the 
prospective works built with the runs of consecutive occurrences of the work (T2 = bba into 
S. We thus identify bbaa$ (with = bbaa) as the median suffix in S. 

Phase transition (t — 1 — ^ t) for t > 1. We are now ready to describe the generic phase 
transition (t — 1 — > t) more formally in terms of the active suffixes in At~i and the inactive 
ones in It-i, where t > 1. 

The input for the phase transition is the following: (a) the current prefix crt-i of the 
(unknown) kth lexicographically smallest suffix in S; (b) the set At-i of currently active 
suffixes; (c) the number z^-i of suffixes in 2t-i whose work is smaller than that of the 
suffixes in At-i (hence, we have to find the (A; — zt-i)th. smallest suffix in At-i); and (d) a 
Boolean vector whose ith element is false (resp., true) iff, for suffix Si E the algorithm 
BFPRT has determined that its work Wi is smaller (resp., larger) than (Xt-i- The output of 
the phase transition are data (a)-(d) above, updated for phase t. 

We now define collisions and prospective works in a formal way. We say that two 
suffixes Si, Sj G At collide if their works Wi and Wj are adjacent as substrings in S, namely, 
\i — j\ = \wi\ = \wj\. A collision C is the maximal subsequence wi^wi^ ■■■wi^, such that 
wi^ = wi^ = ■ ■ ■ = = at, where the active suffixes si^ and si^^^ collide for any 1 < / < r. 
For our algorithm, a collision can also be a degenerate sequence of just one active suffix Sj 
(since its work does not collide with that of any other active suffix). 

The prospective work of a suffix Sj G At-i, denoted by pi, is defined as follows. Consider 
the collision C to which Si belongs. Suppose that Si is the /ith active suffix (from the left) in 
C, that is, C = wi^wi^ ■ ■ ■ wi^_-^Wi'Wi^_^^ ■ ■ ■ wi^_^wi^.. Consider the suffix Su € Tt-i adjacent 
to wi^ (because of the definition of collision, Su must be an inactive suffix following wi^). 
We define the prospective work of Sj, to be the string pi = WiWi^^^ ■ ■ ■ wi^_^wi^Wu- Note 
that Wi = wi^_^^ = ■ ■ ■ = wi^_-^ = wi^ = (Jt-i since their corresponding suffixes are all active, 
while Wu is shorter. In other words, pi = alZiWu, with \wu\ < |o"t-i|. 

Lemma 2.1. For any two suffixes Si,Sj € At, we can compare their prospective works pi 
and pj in O (1) time. 

We now give the steps for the phase transition. Note that we can maintain At-i in 
monotone order of suffix position (i.e., i < j implies that Si comes first than Sj in At-i)- 

(1) Scan the active set At-i and identify its collisions and the set containing all the 
suffixes Su € It-i such that Wu immediately follows a collision. For any suffix Sj in 
At-i, determine its prospective work pi using the collisions and 

(2) Apply algorithm BFPRT to the set {Pi} Si£At-i ^^ing the constant-time comparison 
as stated in Lemma [2.11 In this way, find the (k — Zf_i)th lexicographically smallest 
prospective work p, and the corresponding set At = {sj G At-i \ Pi = p} of active 
suffixes whose prospective works match p. 

(a) If \At\ = 1, stop the computation and return the singleton Si G At as the kth 
smallest suffix in S. 

(b) If \At\ > 1, set at = p (and update zt accordingly). 

(3) For each Sj G At- Let p = WiWi^^^ ■ ■ ■ wi^Wu be its prospective work, where Su G 
Set its new work to he Wi = p = at- 

(4) For each Sj G At-i — At, leave its work Wj unchanged and, as a byproduct of running 
BFPRT in step [21 update position j of the Boolean vector (d) given in input, so as 
to record the fact that wj is lexicographically smaller or larger than at- 
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Lemma 2.2. Executing phase transition {t — 1 ^ t) with t > 1, requires O {\At~i\) time in 
the worst case. 

Invariants for phase t. Before proving the correctness and the complexity of our algo- 
rithm, we need to establish some invariants that are maintained through the phase tran- 
sitions. We say that Wi is maximal if there does not exist another suffix sj such that Wj 
contains Wi, namely, such that j < i and i -|- < j + \wj\. For any t > 1, the following 
invariants holds (where is trivially the set of all the suffixes): 

(i) [prefixes]: at-i and at are prefixes of the (unknown) kth smallest suffix of S, and 

< \crt\. 

(ii) [works]: For any suffix Sj, its work Wi is either degenerate (a single mismatching 
symbol) or Wi = af for a phase t' < t. Moreover, Wi = at iff Sj G At- 

(iii) [comparing]: For any Sj and sj, \wi\ ^ \wj\ implies that we know whether Wi < Wj 
or Wi > Wj. 

(iv) [nesting]: For any two suffixes Sj and Sj, their works Wi and wj do not overlap 
(either they are disjoint or one is contained within the other). Namely, i > j implies 
i + l^^il < J + \ wj\ oi i > j + \ wj\. 

(v) [covering]: The works of the active suffixes are all maximal and, together with the 
maximal works generated by the inactive suffixes, form a non- overlapping covering 
of S (i.e. S = Wij^Wi^ ■ ■ ■ Wi^, where ii < 12 < • • ■ < ir and either Si- E At, or Si- G It 
and Wi- is maximal, for 1 < j < r). 



Lemma 2.3. After phase transition (t — 1 — > t) with t > 1, the invariants (i)- (v) 
maintained. 



are 



Theorem 2.4. The algorithm terminates in a phase t < N, and returns the kth lexico- 
graphically smallest suffix. 

Theorem 2.5. Our suffix selection algorithm requires O (N) time in the worst case. 

This simpler suffix selection algorithm is still cache "unfriendly". For example, it re- 
quires O (N) block transfers with a string S with period length (B) (if 5 is a prefix of 5* 
for some integer i, then g is a period of S). 

3. Cache-Aware Sparse SufRx Selection 

In the sparse suffix selection problem, along with the string S and the rank k of the 
suffix to retrieve, we are also given a set ^ of suffixes such that |^| = O (N/B) and the 
kth. smallest suffix belongs in J^. We want to find the wanted suffix in O {N/B) block 
transfers using the ideas of the algorithm described in Section [2j 

Consider first a particular situation in which the suffixes are equally spaced B positions 
each other. We can split S into blocks of size B, so that S is conceptually a string of N/B 
metacharacters and each suffix starts with a metacharacters. This is a fortunate situation 
since we can apply the algorithm described in Section [2] as is, and solve the problem in the 
claimed bound. The nontrivial case is when the suffixes can be in arbitrary positions. 

Hence, we revisit the algorithm described in Section [2] to make it more cache efficient. 
Instead of trying to extend the work of an active suffix Sj by just using the works of the 
following inactive suffixes, we try to batch these works in a sufficiently long segment, called 
reach. Intuitively, in a step similar to step [2] of the algorithm in Section [21 we could first 
apply the bfprt algorithm to the set of reaches. Then, after we select a subset of equal 
reaches, and the corresponding subset of active suffixes, we could extend their works using 
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their reaches. This could cause colhsions between the suffixes and they could be managed in 
a way similar to what we did in Section [2j This yields the notion of super-phase transition. 

Super-phase transition. The purpose of a super-phase is to group consecutive phases 
together, so that we maintain the same invariants as those defined in Section [2j However, 
we need further concepts to describe the transition between super-phases. We number the 
super-phases according to the numbering of phases. We call a super-phase m if the first 
phase in it is m (in the overall numbering of phases). 

Reaches, pseudo-collisions and prospective reaches. Consider a generic super-phase m. Re- 



call that, by the invariant (v) in Section [21 the phase transitions maintain the string S 
partitioned into maximal works. We need to define a way to access enough (but not too 
many) consecutive "lookahead" works following each active suffix, before running the super- 
phase. Since some of these active suffixes will become inactive during the phases that form 
the super-phase, we cannot prefetch too many such works (and we cannot predict which 
ones will be effectively needed). This idea of prefetching leads to the following notion. 

For any active suffix Sj E Am, the reach of Si, denoted by rj, is the maximal sequence 
of consecutive works wi-^ wi^ ■ ■ ■ wi^. such that 

(i) i < h < I2 < ■ ■ ■ < If and If -h < B; 

(ii) Wi and wi-^ are adjacent and, for I < x < f , wi^_-^ and wi^ are adjacent in S\ 

(iii) if Sj is the leftmost active suffix m. S[i + 1 . . . N], then If < j. 



We call a reach full if If < j in condition (iii) , namely, we do not meet an active suffix 



while loading the reach. Since we know how to compare two works, we also know how to 
compare any two reaches rj,rj, seen as sequences of works. We have the following. 

Lemma 3.1. For any two reaches ri and rj, such that |rj| < \rj\, we have that ri cannot 
be a prefix of rj . 

Using reaches, we must possibly handle the collisions that may occur in an arbitrary 
phase that is internal to the current super-phase. We therefore introduce a notion of collision 
for reaches that is called pseudo-collision because it does not necessarily implies a collision. 

For any two reaches ^j, such that i < j, we say that rj and rj pseudo- collide if = rj 
and the last work of rj is wj itself (not just equal to Wj). Thus, the last work of rj is 
active and equal to Wi and wj. Certainly, the fact that and rj pseudo-collide during 
a super-phase does not necessarily imply that the works Wi and wj collide in one of its 
phases. A pseudo-collision PC (/) is a maximal sequence n^r/j • • ■ such that and r/^^^ 
pseudo-collide, for any 1 < / < a. For our algorithm, a degenerate pseudo-collision is a 
sequence of just one reach. 

Let us consider an active suffix Si and the pseudo-collision to which rj belongs. Let 
us suppose that the pseudo-collision is n^r/j • • ■ r^^ ^rjn^^^ ■ ■ " Ha (i-e- is the /th reach). 
Also, let us consider the reach r^ of the last work Wu that appears in r;^ (by the definition 
of pseudo-collision, we know that the last work Wu of ri^ is equal to its first work, so Su is 
active and has a reach). The prospective reach of an active work Wi, denoted by pr^, is the 
sequence riri^^^ ■ ■ ■ ri^tail (pr^), where tail (pr^) = Icp {ri,ru) is the tail of pr^ and denotes 
the longest initial sequence of works that is common to both rj and r^. Analogously to 
prospective works, we can define a total order on the prospective reaches. The multiplicity 
of prj, denoted by mult (pr^), is a — / -|- 1 (that is the number of reaches following rj in the 
pseudo-collision plus rj). 
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Lemma 3.2. // the invariants for the phases hold for the current super-phase then, for any 
two reaches ri and rj such that ri = rj , we have that their prospective reaches pr^ and prj 
can be compared in O (1) time, provided we know the lengths of tail {pr^) and tail [prj^. 

Super-phase transition (m m') . The transition from a super-phase m to the next super- 
phase m' emulates what happens with phases m,m + l, . . . ,m' va. the algorithm of Section O 
but using 0{N/B) block transfers. 

(1) For each active suffix Sj, we create a pointer to its reach rj. 

(2) We find the {k — Zm)^^ lexicographically smallest reach p using bfprt on the 
O (N/B) pointers to reaches created in the previous step. The sets ^= = {si \ Si is 
active and ri = p}, = {si \ Si is active and < p}, and ^> = {si \ si is active 
and > p} are thus identified, and, for any Sj G U.^>, the length of Icp (ri, p)^ 
If \^=\ = 1, we stop and return Sj, such that Si G as the kth. smallest suffix 
in S. 

(3) For any Si G =^=, we compute its prospective reach pr^. 

(4) We find the (A: 

~ Zm ~ lexicographically smallest prospective reach vr among 

the ones in {pr^ \ Sj G ^=}, thus obtaining = {si \ Si is active and pr^ = vr}, 
=f^< = {si I Si is active and pr^ < vr}, ^> = {sj [ Sj is active and pr^ > tt}, and, for 
any Sj G .f^< U the length of Icp (prj,7r). If \'^=\ = 1, we stop and return Si, 
such that Si G as the kth smallest sufhx in S. 

Theorem 3.3. The sparse suffix selection problem can be solved using O (N/B) block trans- 
fers in the worst case. 

4. Optimal Cache-Aware SufRx Selection 

The approach in Sec. [3] does not work if the number of input active suffixes is uj {N/B). 
The process would cost O log -B) block transfers (since it would take (log B) transitions 
to finally have O (N/B) active suffixes left). However, if we were able to find a set of 
0{N/B) suffixes such that one of them is the fcth smallest, we could solve the problem 
with O (N/B) block transfers using the algorithm in Sec. [3j In this section we show how to 
compute such a set J^. 

Basically, we consider all the substrings of length B S and we select a suitable set 
oi p > B pivot substrings that are roughly evenly spaced. Then, we find the pivot that is 
lexicographically "closest" to the wanted k-ih and one of the following two situations arises: 

• We are able to infer that the A;th smallest sufhx is strictly between two consecutive 
pivots (that is its corresponding substring of B characters is strictly greater and smaller 
of the two pivots). In this case, we return all the 0{N/p) = 0{N/B) suffixes that are 
contained between the two pivots. 

• We can identify the suffixes that have the first B characters equal to those of the kth. 
smallest suffix. We show that, in case they are still Vt (N/B) in number, they must satisfy 
some periodicity property, so that we can reduce them to just 0(N/B) with additional 
O (N/B) block transfers. 

^Given strings S and T, their longest common prefix Icp {S, T) is longest string U such that both 5* and 
T start with U. 
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4.1. Finding pivots and the key suffixes 

Let p = \J for a suitable constant c > 1. We proceed with the folfowing steps. 

First. We sort the first M'^ substrings of length B of S (that is substrings 5[1 ■ ■ ■ B], 
S[2...B + 1],..., ^[M^ -1...B + W-2], S[W ...B + M"- 1]). Then we sort the second 
M'^ substrings of length B and so forth until all the N positions in S have been considered. 
The product of this step is an array V oi N pointers to the substrings of length B of S. 

Second. We scan V and we collect in an array U of N/p positions the N/p pointers 
V[p],V[2plV[M,.... 

Third. We (multi)-select from U the p pointers to the substrings (of length B) bi, . . . ,bp 
such that hi has rank i^ among the substrings (pointed by the pointers) in U. These are 

the pivots we were looking for. We store the p (pointers to the) pivots in an array U' . 

Fourth. We need to find the rightmost pivot bx such that the number of substrings 
(of length B of S) lexicographically smaller than bx is less than k (the rank of the wanted 
suffix). We cannot simply distribute all the substrings of length B according to all the p 
pivots in U' , because it would be too costly. Instead, we proceed with the following refining 
strategy. 

1. From the p pivots in U' we extract the group Gi of 6M equidistant pivots, where 
(5 < 1 is a suitable constant, (i.e. the pivots bt, b2t, ■ ■ ■, where t = j^)- Then, for any 
bj € Gi, we find out how many substrings of size B are lexicographically smaller 
than bj. After that we find the rightmost pivot bx^ G Gi such that the number of 
substrings (of length B) smaller than bx^ is less than k. 

2. From the pivots in U' following bx^ we extract the group G2 of 5M equidistant 
pivots. Then, for any bj G G2, we find out how many substrings of size B are smaller 
than bj. After that we find the rightmost pivot bx2 G G2 such that the number of 
substrings smaller than bx2 is less than k. 

More generally: 

/. Let Gf be the 5M pivots in U' following bxf_^. Then, for any bj E Gj, we find out 
how many substrings of size B are smaller than bj . After that we find the rightmost 
pivot bxf ^ Gf such that the number of substrings smaller than bx^ is less than k. 

The pivot bx^ found in the last iteration is the pivot bx we are looking for in this step. 

Fifth. We scan S and compute the following two numbers: the number of substrings 
of length B lexicographically smaller than bx', the number n"^ of substrings equal to bx. 

Sixth. In this step we treat the following case: < k < + n^. More specifically, 
this implies that the wanted A:th smallest suffix has its prefix of B characters equal to bx. 
We proceed as follows. We scan S and gather in a contiguous zone R (the indexes of) 
the suffixes of S having their prefixes of B characters equal to bx. In this case we have 
already found the key suffixes (whose indexes reside in R). Therefore the computation in 
this section ends here and we proceed to discard some of them (sec. 14. 2p . 

Seventh. In this step we treat the following remaining case: + < k. In other 
words, in this case we know that the prefix of B characters of the wanted kth smallest 
suffix is (lexicographically) greater than bx and smaller than fe^+i- Therefore, we scan S 
and gather in a contiguous zone R (the indexes of) the suffixes of S having their prefix 
of B characters greater than bx and smaller than bx+i. Since there are less than N/B 
such suffixes (see below Lemma [4. ip . we have already found the set of sparse active suffixes 
(whose indexes reside in R) that will be processed in Sec. El 



OPTIMAL CACHE-AWARE SUFFIX SELECTION 



467 



Lemma 4.1. For any S and k, either the number of key suffixes found is O (N/B), or their 
prefixes of B characters are all the same. 

Lemma 4.2. Under the tall-cache assumption, finding the key suffixes needs O {N/B) block 
transfers in the worst case. 

4.2. Discarding key suffixes 

Finally, let us show how to reduce the number of key suffixes gathered in Sec. 14.11 to 
< 2N/B so that we can pass them to the sparse suffix selection algorithm (Sec. [3]). Let us 
assume that the number of key suffixes is greater than 2N/B. 

The indexes of the key suffixes have been previously stored in an array R. Clearly, the 
kth smallest suffix is among the ones in R. We also know the number re^ of suffixes of S 
that are lexicographically smaller than each suffix in R. Finally, we know that there exists 
a string q of length B such that R contains all and only the suffixes Si such that the prefix 
of length B of Si is equal to q (i.e. R contains the indexes of all the occurrences of q in S). 

To achieve our goal we exploit the possible periodicity of the string q. A string u is a 
period of a string v (|n| < \v\) ifv is a prefix of n* for some integer i > 1. The period of v 
is the smallest of its periods. We exploit the following: 

Property 1 ([8]). If g occurs in two positions i and j of S and < j — i < \q\ then q has 
a period of length j — i. 

Let u be the period of q. Since the number of suffixes in R is greater than 2N/B, there 
must be some overlapping between the occurrences of q in S. Therefore, by Property [U we 
can conclude that \u\ < \q\. For the sake of presentation let us assume that \q\ is not a 
multiple of \u\ (the other case is analogous). 

From how R has been built (by left to right scanning of S) we know that the indexes in it 
are in increasing order, that is R[i] < R[i+1], for any i (i.e. the indexes in R follow the order, 
from left to right, in which the corresponding suffixes may be found in S). Let us consider 
a maximal subsequence Ri of R such that, for any 1 < j < \Ri\, Ri[j + 1] — Ri\j] < B/2 
(i.e. the occurrence of g in S starting in position Ri[j] overlaps the one starting in position 
Ri[j + 1] by at least B/2 positions). Clearly, any two of these subsequences of R do not 
overlap and hence R can be seen as the concatenation R1R2 ■ ■ ■ of these subsequences. From 
the definition of the partitioning of R and from the periodicity of q we have: 

Lemma 4.3. The following statements hold: 

(i) There are less than 2N/B such subsequences. 

(ii) For any Ri, the substring S[Ri[l\ . . . Ri\\Ri\\ + B — 1\ (the substring of S spanned by 
the substrings whose indexes are in Ri) has period u. 

(iii) The substring of length B of S starting in position + |u| (the substring 
starting one period-length past the rightmost member of Ri) is not equal to q. 

For any key suffix Sj, let us consider the following prefix: psj = S[j . . . Ri[\Ri\] + \u\ + 
B — 1], where Ri is the subsequence of R where (the index of) sj belongs to. By Lemma [4. 3 1 
we know two things about psj: (a) the prefix of length \psj \ — \u\ of psj has period u; (6) 
the suffix of length B of psj is not equal to q. 

In light of this, we associate with any key suffix Sj a pair of integers {aj, (3j) defined as 
follows: aj is equal to the number of complete periods u in the prefix of length \psj \ — \u\ 
of psj] (5j is equal to + \u\ (that is the index of the substring of length B starting one 
period-length past the rightmost member of Ri). 
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There is natural total order <] that can be defined over the key sufiixes. It is based on 
the pairs of integers {aj,Pj) and it is defined as follow. For any two key suffixes Sji^Sjir. 

• If aji = ajii then Sji and Sjn are equal (according to <). 

• If Uji < ajii then Sji <\ Sj" iff S[Pj' . . . f3j' + B — 1] is lexicographically smaller than q. 
By Lemma r4.3( we know that the suffix of length B of psj' (which is the substring S[(3j> . . . + 
B — 1]) is not equal to q. Therefore the total order <l is well defined. 

We are now ready to describe the process for reducing the number of key suffixes. We 
proceed with the following steps. 

First. By scanning S and R, we compute the pair {aj,(3j) for any key suffix sj. The 
pairs are stored in an array (of pairs of integers) Pairs. 

Second. We scan S and compute the array Camp of N positions defined as follows: for 
any 1 < i < N, Comp[i] is equal to —1, or 1 if S[i . . .i + B — 1] is less than, equal to or 
greater than q, respectively (the array Camp tells us what is the result of the comparison 
of q with any substring of size B different from it). 

Third. By scanning Pairs and Camp at the same time, we compute the array PComp 
of size \Pairs\, such that, for any Z, PComp[l] = Comp[Pairs[l].p]] (where Pairs[l].f3 is the 
second member of the pair of integers in position I of Pairs). 

Fourth. Using Pairs and PComp, we select the {k — n^)-th smallest key suffix Sx and 
all the key suffixes equal to Sx according to the total order <\ (where is the number of 
suffixes of S that are lexicographically smaller than each suffix in R, known since Sec. 14. ip . 
The set of the selected key suffixes is the output of the process. 

Lemma 4.4. At the end of the discarding process, the selected key suffixes are less than 
2N/B in number and the kth lexicographically smallest suffix is among them. 

Lemma 4.5. The discarding process requires O (N/B) block transfers at the worst case. 

Theorem 4.6. The suffix selection problem for a string defined over a general alphabet can 
be solved using O {N/B) block transfers in the worst case. 
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